Skip to content

Add netstacklat tool #125

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 11 commits into
base: main
Choose a base branch
from
Open

Add netstacklat tool #125

wants to merge 11 commits into from

Conversation

simosund
Copy link
Contributor

Initial implementation of the netstacklat tool, which is based on Jesper Brouers's bpftrace script for measuring latency up to certain certain points in the Linux network stack.

The eBPF portion of the program is designed to be compatible with ebpf_exporter. In theory one only has to drop in the netstack.bpf.c file into the ebpf_exporter/examples directory and add the corresponding yaml config file (which I will probably add to this repository as well later), although have not yet verified that it actually works as intended there.

In addition to finishing up existing commits, some points I wish to address before attempting to merge this request is:

  • Figure out how to use vmlinux.h properly in this setup (or if it's better to get definition of sk_buffer some other way)
  • Add option to the userspace program to enable software timestamping on select interface(s) to make it a bit easier to use? Currently you're unlikely to actually get any useful data out of this unless you enable software timestamping on relevant interfaces externally (as skb->tstamp will just be 0).
  • Verify that it works with ebpf_exporter and add yaml config file.

@tohojo tohojo requested a review from netoptimizer January 13, 2025 12:23
@simosund simosund force-pushed the netstacklat branch 2 times, most recently from 36dd2e9 to 05b8542 Compare January 14, 2025 10:02
@simosund
Copy link
Contributor Author

So, I now belive I have a full "minimal working example" that quite closely replicates the bpftrace script and prints out its output in a similar format. There are still many missing features that would be nice to have, such as:

  • Parse some command-line arguments, letting the user configure e.g. the reporting interval (current hard coded to 10s)
  • Automatically enable software RX timestamping (on all interfaces or interfaces provided as command line arguments)
  • Automatically figure out the TAI-offset at start up (currently hard coded as 37 seconds)
  • Figure out clock basis for skb->tstamp (currently assumes CLOCK_REALTIME, but on at least one machine that did not seem to be the case), and possibly if it's valid (right now it assumes any non-zero timestamps are valid)

But figure that this should be sufficent for an early review at least.

One issue worth highlighting is that it's currently not fully compatible with ebpf-exporter in the sense that you can't just drop in the netstacklat.bpf.c file. First of there are some shared defines and enums in netstacklat.h that netstacklat.bpf.c depends on, so unless you also copy over that header (which seems to go against the style of the existing examples at ebpf-exporter/examples) you need to first insert those into the netstacklat.bpf.c file. Furthermore, netstacklat.bpf.c does #include "vmlinux_local.h" as expected for this bpf-examples repository, but the ebpf-exporter examples seem to expect #include <vmlinux.h> instead. Not sure if there's any clever pre-processor magic or other tricks you could use to solve this.

For licensing, I've put my own files under GPL-2.0 as most other examples here. However, I've also included the same version of bits.bpf.h header from bcc/libbpf-tools (same version as ebpf-exporter uses), which then retains its dual LGPL/BSD license, and maps.bpf.h from ebpf-exporter that I put under MIT as that seems to be what ebpf-exporter in general uses.

As I mentioned during the meeting, I intentionally opted to use separate BPF maps for each histogram rather than multiplex multiple histograms in the same map as most of the ebpf-exporter examples seem to do, e.g. here. By doing so I can simply use array maps instead of hashmaps and thereby reduce the overhead for map lookups a bit.

@simosund simosund marked this pull request as ready for review January 14, 2025 10:46
@simosund
Copy link
Contributor Author

Oh, and ping @netoptimizer if Github hasn't already done so (not intending to rush you, just want to make sure you're aware this PR is ready to be reviewed).

@simosund simosund force-pushed the netstacklat branch 2 times, most recently from ea1d185 to cf737fe Compare January 24, 2025 13:53
@simosund
Copy link
Contributor Author

While I've done a couple of force pushes, there's been no major change to the initial 3 commits (which was what I originally requested the review for). I've just done some minor cleanup there (renamed an enum value from NETSTACK_HOOK_TCP_V4_DO_RCV to NETSTACKLAT_HOOK_TCP_V4_DO_RCV to be consistent with the others, and prefixed the commit messages with netstacklat:).

I have now also added 3 additional commits implementing most of the features I mentioned were missing in my previous comment. They are all relatively self-contained I can move them out into a separate PR(s) and/or reorder them if requested.
There are however some aspects of those later PRs that I would also like to get some feedback on, so I've included them here for the moment (as it seems weird starting up a PR for additions to another PR that hasn't been merged yet).

For the commit about enabling the RX software timestamps, it relies on the user space component to do that, so it won't really work when just using the netstacklat.bpf.c file together with ebpf-exporter. While @tohojo pointed out that it seems like they're perhaps working on some way of allowing you to enable SO_TIMESTAMPING from eBPF programs (a bit unclear if that's what it actually does though, and only seems to be for the TX path for now), I don't think we can rely on that as it will be quite some time before that makes it into most shipped kernels.

An alternative approach would be to add an eBPF program at some early part of the network stack that checks if the skb->tstamp field is set, and if not sets it itself. Although I guess not all hooks are necessarily allowed to write to that field, and I'm not sure if there may be any consequences higher up the stack for setting the tstamp field when its not expected to be set.

For the commit about initializing the TAI-offset, it has a similar limitation in that it relies on the user space program, so won't work for ebpf-exporter. As an alternative, it may be possible to get the TAI offset from an eBPF program instead. E.g. the following bpftrace script can get it:

sudo bpftrace -e '
// On kernels >= 6.13, remove this struct definition (no longer an anonymous struct)
struct tk_data {
	seqcount_raw_spinlock_t	seq;
	struct timekeeper	timekeeper;
};

BEGIN
{
	printf("Getting clock offsets from core...\n");

	$tk = (struct tk_data *)kaddr("tk_core");
	printf("MONOTONIC -> REALTIME (offs_real): %ld\n",
	       $tk->timekeeper.offs_real);
	printf("MONOTONIC -> TAI (offs_tai): %ld\n",
	       $tk->timekeeper.offs_tai);
	printf("REALTIME -> TAI (tai_offset): %ld\n",
	       $tk->timekeeper.tai_offset);

	exit();
}'

However, I'm not really sure how to translate that to C/libbpf in a good way. First it needs to be attached to some suitable hook that ideally only runs once before all the other hooks trigger so that we don't have to probe that part of kernel memory every time we need to use the TAI offset. And then I'm not really sure if there's really anything corresponding the the bpftrace kaddr() function you can use in C/libbpf. Through a brief look it seems to lookup the /proc/kallsyms "file" and get the kernel address from there.

Furthermore, the TAI offset is currently only set at startup, so if the program is running when a leap second occurs, the TAI offset will be out of sync, and we will get massively inflated latencies. We could periodically update it, although that still runs the risk of it being out of sync in-between updates. We could also try to hook into whatever part of the kernel that actually updates that offset, so that we change it as soon as the kernel updates it. But in order to update it, it can obviously no longer be a constant, so it will potentially add a small amount of overhead when using the TAI offset.

@simosund
Copy link
Contributor Author

When it comes to output format, I have for now approximately replicated the output format from @netoptimizer's bpftrace script. So it may for example look like this:

Mon Jan 27 20:02:01 2025
tcp_v4_do_rcv:
[1.02us, 2.05us)       1054 |                                                |
[2.05us, 4.1us)      181795 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
[4.1us, 8.19us)      181529 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
[8.19us, 16.4us)      68458 |@@@@@@@@@@@@@@@@@@                              |
[16.4us, 32.8us)       1187 |                                                |
[32.8us, 65.5us)        990 |                                                |
[65.5us, 131us)         453 |                                                |
[131us, 262us)            8 |                                                |
Sum: 2373204797

tcp_data_queue:
[2.05us, 4.1us)      169968 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@   |
[4.1us, 8.19us)      179638 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
[8.19us, 16.4us)      51028 |@@@@@@@@@@@@@@                                  |
[16.4us, 32.8us)       1176 |                                                |
[32.8us, 65.5us)        900 |                                                |
[65.5us, 131us)         433 |                                                |
[131us, 262us)            8 |                                                |
Sum: 2320708288

udp_queue_rcv_one_skb:
[1.02us, 2.05us)          8 |@@@@@@@@@@@@@@@@@                               |
[2.05us, 4.1us)          23 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
[4.1us, 8.19us)          12 |@@@@@@@@@@@@@@@@@@@@@@@@@                       |
[8.19us, 16.4us)         14 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@                   |
[16.4us, 32.8us)          7 |@@@@@@@@@@@@@@@                                 |
[32.8us, 65.5us)          2 |@@@@                                            |
Sum: 571657

(For reference, the corresponding output from the bpftrace script is:

20:02:01
@cnt[kfunc:vmlinux:udp_queue_rcv_one_skb]: 66
@cnt[kfunc:vmlinux:tcp_data_queue]: 404016
@cnt[kfunc:vmlinux:tcp_v4_do_rcv]: 436508
@total[kfunc:vmlinux:udp_queue_rcv_one_skb]: 66
@total[kfunc:vmlinux:tcp_data_queue]: 404022
@total[kfunc:vmlinux:tcp_v4_do_rcv]: 436514
@stats[kfunc:vmlinux:tcp_v4_do_rcv]: count 436515, average 5148, total 2247484837
@stats[kfunc:vmlinux:tcp_data_queue]: count 404024, average 5489, total 2217754970
@stats[kfunc:vmlinux:udp_queue_rcv_one_skb]: count 66, average 6507, total 429464

@hist_ns[kfunc:vmlinux:udp_queue_rcv_one_skb]: 
[1K, 2K)              10 |@@@@@@@@@@@@@@@@@@@@                                |
[2K, 4K)              26 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
[4K, 8K)              11 |@@@@@@@@@@@@@@@@@@@@@@                              |
[8K, 16K)             13 |@@@@@@@@@@@@@@@@@@@@@@@@@@                          |
[16K, 32K)             5 |@@@@@@@@@@                                          |
[32K, 64K)             1 |@@                                                  |

@hist_ns[kfunc:vmlinux:tcp_data_queue]: 
[1K, 2K)              25 |                                                    |
[2K, 4K)          175047 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ |
[4K, 8K)          177977 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
[8K, 16K)          48539 |@@@@@@@@@@@@@@                                      |
[16K, 32K)          1135 |                                                    |
[32K, 64K)           880 |                                                    |
[64K, 128K)          420 |                                                    |
[128K, 256K)           7 |                                                    |

@hist_ns[kfunc:vmlinux:tcp_v4_do_rcv]: 
[256, 512)             1 |                                                    |
[512, 1K)              4 |                                                    |
[1K, 2K)           28061 |@@@@@@@                                             |
[2K, 4K)          158756 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@        |
[4K, 8K)          185669 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
[8K, 16K)          61565 |@@@@@@@@@@@@@@@@@                                   |
[16K, 32K)          1072 |                                                    |
[32K, 64K)           946 |                                                    |
[64K, 128K)          442 |                                                    |
[128K, 256K)           8 |                                                    |

)

So it "visualizes" the data in a format that makes it fairly easy to glance at, but not really ideal if you want to parse it and feed it into some other system. But if you want to feed the data into some other system it's perhaps better to use it together with ebpf-exporter in the first place.

@simosund
Copy link
Contributor Author

simosund commented Jan 27, 2025

I have also verified that it "works" together with ebpf-exporter (with some minor changes to the includes). However, as this issue for ebpf-exporter points out, it seems like ebpf-exporter sort of has an of-by-one error in their log2 histogram implementation, so when using with Prometheus you will generally get around half of the actual latency (and this issue affects all of the log2 histogram examples currently included in ebpf-exporter).

Note that the output from netstacklat when used here in bpf-examples is still correct as I print the correct bin labels for the histogram (acknowledging that the way the bin-index is calculated the bins are left-inclusive). But when used with ebpf-exporter the bins will be incorrectly labeled (to pretend to match Prometheus right-inclusive bin format), and therefore give misleading results.

To clarify the issue a bit, I provide the following table:

Bin index Bin content Effective label (Actual Prometheus label)
0 [0, 2) [0, 1] le1
1 [2, 4) (1, 2] le2
2 [4, 8) (2, 4] le4
3 [8, 16) (4, 8] le8
4 [16, 32) (8, 16] le16
... ... ... ...

Where the index is the index in the BPF array map for the bin (and the index that will essentially be calculated by the histogram helpers in maps.bpf.h, content is the range of values that will be mapped to that bin (and the label I provide in my own output), effective label is basically what ebpf-exporter will say is stored in that bucket (with the corresponding Prometheus label, where le = less or equal).

I'm not really sure what the best way to address this problem is. I could use the updated bin calculation from the PR that intends to fix the issue in ebpf-exporter. However, until that fix is merged in ebpf-exporter itself that won't really change anything (as the function for calculating the bin-index will be based on maps.bpf.h in the current repo).

@simosund
Copy link
Contributor Author

I've now added the ebpf-exporter YAML config file.

Furthermore, in the "Make build process compatible with ebpf-exporter" commit I have provided a suggestion for how to make netstacklat.bpf.c compile correctly both here in the bpf-examples repository as well as in when copied into the ebpf_exporter/examples directory. You will still have to also copy over the netstacklat.h file, which breaks the style compared to the other exampels in ebpf_exporter/examples a bit, but works without having to actually modify any file.

Copy link
Member

@tohojo tohojo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think generally this looks sane, with the main exception being the macros from bpf_exporter. Those we should definitely not import!

@simosund
Copy link
Contributor Author

simosund commented Feb 5, 2025

I've now done a few sanity tests to validate that it seems to able to capture latency increases in the network stack. Simply used 10 parallel iperf flows to generate traffic and did a test without iptables rules, and another test after adding ~25k pointless iptables rules.

These first graphs are without the iptables rules.
image

These graphs are with the 25k iptables rules.
image

Note that the median and 95th percentile values should likely be around twice as high due to the issue with how ebpf-exporter bins the values that I previously mentioned.

While these graphs may not be the easiest to read, the latency one can note that the latency increases quite a bit after adding the 25k iptables rules, with e.g. the median roughly doubling and the mean for tcp_v4_do_rcv() increasing from around 12us to 200us (the mean for tcp_data_queue() actually drops from around 12us to 6us though). The heatmap also shows that the distribution for tcp_v4_do_rcv() shifts from a rather unimodal distribution to a much more multimodal one. Interestingly enough, the relationship between how frequently tcp_v4_do_rcv() and tcp_data_queue() also shifts, where tcp_v4_do_rcv() only runs around 15% more frequently than tcp_data_queue() without all the iptables rules, whereas it runs about twice as frequently with the 25k rules (where the rate for tcp_data_queue() seems to match the actual packet rate).

These tests, however, also somewhat highligted that the log2 histograms are indeed very coarse grained, as a latency increase will generally only be noticible if it roughly doubles the latency (as that's what's required to end up in another histogram bin). And in this case the impact of adding those 25k iptables rules could for example much easier be noticed by the packet rate dropping from around 260 kpps to 24 kpps. Instead of log2 histograms we could of course use something more fine-grained, but already at our current 30 buckets we already exceed what's typically used with Prometheus histograms.

In the last year or so Prometheus has added support for "native histograms" that can handle a large amount of bins much more efficently. However, that is still an experimental feature and not supported by ebpf-exporter.

@simosund
Copy link
Contributor Author

simosund commented Feb 5, 2025

Have also looked a little bit at the performance. Based on the bpf stats (sysctl -w kernel.bpf_stats_enabled=1), the overhead from the eBPF probes for the current implementation looks as follows:

id   , name                      ,        count, avg runtime (ns), frequency (Hz), est. CPU util (%)
2093 , netstacklat_tcp_v4_do_rcv ,     93309605,           176.03,      310702.88,         5.469346%
2095 , netstacklat_tcp_data_queue,     82011162,           134.08,      273081.26,         3.661600%
2096 , netstacklat_udp_queue_rcv ,          255,           762.80,           0.85,         0.000065%
0    , Total                     ,    175321022,           156.41,      583784.99,         9.131011%

Where count is the number of times the hook ran, avg runtime is the processing time per hook activation, frequency is how frequently the hook was triggered, and estimated CPU utilization is the single core CPU utilization for the program given at that runtime and frequency (i.e. runtime * frequency).

So the tcp_v4_do_rcv() hook takes up around 175ns per activation, while the tcp_data_queue() hook takes up around 135ns per activation while under load (the 762ns for udp_queue_rcv() is a bit inflated as it's not really under any load, if I redid the tests with a UDP traffic load I expect it would drop to a similar level).

I also tested with a version where I multiplex all the histograms in a single hashmap (how ebpf-exporter typically does it, and produces a more easily handled Prometheus metric), which gave the following results:

id   , name                      ,        count, avg runtime (ns), frequency (Hz), est. CPU util (%)
2057 , netstacklat_tcp_v4_do_rcv ,     94031163,           266.09,      313093.05,         8.331175%
2059 , netstacklat_tcp_data_queue,     82230810,           192.18,      273801.73,         5.261953%
2060 , netstacklat_udp_queue_rcv ,          131,           832.94,           0.44,         0.000036%
0    , Total                     ,    176262104,           231.61,      586895.21,        13.593164%

So about 60ns (40%) more for tcp_data_queue() and 90ns (50%) more for tcp_v4_do_rcv().

I also tested a dummy version that just picked a fixed value for the delay instead of getting it from reading the skb timestamp and comparing to the current time to get a sort of lower floor for what the maximum performance could reasonably be (only overhead for adding the value to the histogram, using the separate array maps), which gave the following:

id   , name                      ,        count, avg runtime (ns), frequency (Hz), est. CPU util (%)
276  , netstacklat_tcp_v4_do_rcv ,     89714195,            70.09,      297714.38,         2.086633%
278  , netstacklat_tcp_data_queue,     76804520,            55.17,      254873.94,         1.406174%
279  , netstacklat_udp_queue_rcv ,          379,           373.21,           1.26,         0.000047%
0    , Total                     ,    166519094,            63.21,      552589.58,         3.492854%

So just parsing the the skb timestamp and getting the current time seems to make up the majority of the overhead in the current implementation.

One interesting observation here is that the overhead from the tcp_v4_do_rcv() hook (per activation) seems to be consistently higher than that for tcp_data_queue(), despite all hooks running essentially the same code and all of them being fentry hooks. Furthermore, the it seems that tcp_v4_do_rcv() hook takes roughly about 30% longer than tcp_data_queue() in all these configurations, rather than being a fixed overhead (e.g. 40ns).

As some reference points, I also ran similar tests with epping and the Kyanos tool, result are as follows:
epping:

id   , name            ,        count, avg runtime (ns), frequency (Hz), est. CPU util (%)
189  , pping_tc_egress ,     79354117,           583.30,      251995.04,        14.698873%
191  , pping_tc_ingress,     15509659,           829.35,       49252.10,         4.084723%
192  , tsmap_cleanup   ,       966946,           427.65,        3219.48,         0.137680%
193  , flowmap_cleanup ,         3289,           323.04,          10.95,         0.000354%
0    , Total           ,     95834011,           621.54,      304477.57,        18.921630%

kyanos:

id   , name                                      ,        count, avg runtime (ns), frequency (Hz), est. CPU util (%)
2326 , dev_hard_start_xmit                       ,     91032535,           837.27,      302098.44,        25.293666%
2333 , security_socket_recvmsg_enter             ,     37333410,            91.60,      123893.78,         1.134898%
2334 , security_socket_sendmsg_enter             ,     28416279,            87.66,       94301.60,         0.826667%
2337 , sock_alloc_ret                            ,         1737,           713.72,           5.76,         0.000411%
2338 , tcp_destroy_sock                          ,           98,          2661.88,           0.33,         0.000087%
2341 , tcp_v4_do_rcv                             ,     86890360,           658.30,      288352.32,        18.982194%
2343 , tracepoint__netif_receive_skb             ,     91034180,           771.08,      302103.90,        23.294533%
2344 , tracepoint__sched__sched_process_exec     ,           13,         12189.92,           0.04,         0.000053%
2346 , tracepoint__skb_copy_datagram_iovec       ,     94955297,           441.97,      315116.43,        13.927324%
2347 , tracepoint__syscalls__sys_enter_accept4   ,          140,          1562.58,           0.46,         0.000073%
2348 , tracepoint__syscalls__sys_enter_close     ,        43195,           444.89,         143.35,         0.006377%
2349 , tracepoint__syscalls__sys_enter_connect   ,          363,           921.93,           1.20,         0.000111%
2350 , tracepoint__syscalls__sys_enter_read      ,       105619,           660.55,         350.50,         0.023153%
2352 , tracepoint__syscalls__sys_enter_recvfrom  ,     37107108,           350.08,      123142.78,         4.311032%
2353 , tracepoint__syscalls__sys_enter_recvmsg   ,       221419,           779.13,         734.80,         0.057250%
2355 , tracepoint__syscalls__sys_enter_sendmsg   ,        71460,           829.18,         237.15,         0.019664%
2356 , tracepoint__syscalls__sys_enter_sendto    ,         1426,           699.82,           4.73,         0.000331%
2357 , tracepoint__syscalls__sys_enter_write     ,     28387953,           357.42,       94207.60,         3.367143%
2359 , tracepoint__syscalls__sys_exit_accept4    ,          140,          2785.14,           0.46,         0.000129%
2360 , tracepoint__syscalls__sys_exit_close      ,        43195,           375.93,         143.35,         0.005389%
2361 , tracepoint__syscalls__sys_exit_connect    ,          363,          2790.62,           1.20,         0.000336%
2362 , tracepoint__syscalls__sys_exit_read       ,       105620,           474.41,         350.51,         0.016628%
2364 , tracepoint__syscalls__sys_exit_recvfrom   ,     37107108,           850.75,      123142.78,        10.476430%
2365 , tracepoint__syscalls__sys_exit_recvmsg    ,       221419,          1367.78,         734.80,         0.100504%
2367 , tracepoint__syscalls__sys_exit_sendmsg    ,        71460,          1540.79,         237.15,         0.036539%
2368 , tracepoint__syscalls__sys_exit_sendto     ,         1426,          1829.03,           4.73,         0.000866%
2369 , tracepoint__syscalls__sys_exit_write      ,     28387952,           924.71,       94207.59,         8.711466%
0    , Total                                     ,    561541275,           593.46,     1863517.74,       110.593253%

So in summary, current overhead is fairly low, but still not insignificant. Only adding a few more hook points will likely result in overhead similar to that of epping (in a setup where epping does 3-4 hashmap lookups per packet), although we still have a lot of leg room left until we reach as high overhead as kyanos (but kyanos can also deliver much more detailed data).

@simosund simosund force-pushed the netstacklat branch 2 times, most recently from 6dc4e0f to 7c05bf1 Compare February 12, 2025 15:46
@simosund
Copy link
Contributor Author

With a bunch of help from @skyfmmf I was also able to get lattrace running (although without some hooks that are not available on my kernel). So as an additional reference, here is a similar performance test for lattrace:

id   , name                     ,        count, avg runtime (ns), frequency (Hz), est. CPU util (%)
836  , trace_alloc_skb          ,     27571636,          1124.57,      229129.33,        25.767200%
839  , tp_napi_gro_receive_entry,          900,          1759.16,           7.48,         0.001316%
840  , trace_clone_skb          ,     20855241,          1138.62,      173313.89,        19.733806%
841  , tp_consume_skb           ,     10600335,          1203.15,       88092.26,        10.598854%
842  , tp_kfree_skb             ,         1303,          2020.93,          10.83,         0.002188%
843  , trace_kfree_skb          ,     25886742,          1023.87,      215127.31,        22.026283%
846  , start_conntrack          ,     27532368,           974.68,      228803.00,        22.300972%
847  , end_conntrack            ,     27532368,           919.37,      228803.00,        21.035489%
850  , netif_receive            ,     27532563,           946.98,      228804.62,        21.667320%
851  , netdev_queue             ,     27531724,          1040.48,      228797.65,        23.806037%
852  , tcp_receive              ,     20853296,          1008.86,      173297.72,        17.483325%
853  , udp_receive              ,          112,          1420.97,           0.93,         0.000132%
0    , Total                    ,    215898588,          1027.89,     1794188.00,       184.422921%

So about ~1000ns per hook for lattrace compared to around ~150ns per hook for netstacklat (although the work lattrace actually does varies a little bit based on the hook, while netstacklat essentially runs the same code for every hook).

@simosund
Copy link
Contributor Author

Hi @tohojo @netoptimizer, I've now updated this PR to address Toke's previous comments, so please re-review.

I explain my rationale regarding how I replaced the ebpf-exporter macros in the initial commit while trying to keep it as compatible as possible. I've also done some other minor changes, e.g. renamed the maps to follow the Prometheus naming conventions (as the map names are directly converted to metric names by ebpf-exporter).

@simosund
Copy link
Contributor Author

Can also mention that as a bit of a side-project also wanted to explore whether my idea regarding doing a binary search for finding the correct histogram bins was viable, which can be found my netstacklat-binarysearch branch.

It supports much higher resolution histograms, like e.g. using the exponential base 2^(1/8) instead of 2 (corresponding to schema/scale 3 of the Prometheus/OpenTelemetry histogram bucketing scheme). So instead of:

$ sudo netstacklat --histgram-exp-base 2
tcp_data_queue:
(256ns, 512ns]       589843 |                                                |
(512ns, 1.02us]    60651018 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
(1.02us, 2.05us]    4659890 |@@@@                                            |
(2.05us, 4.1us]     1709485 |@                                               |
(4.1us, 8.19us]     2736061 |@@                                              |
(8.19us, 16.4us]    2152283 |@@                                              |
(16.4us, 32.8us]    2192421 |@@                                              |
(32.8us, 65.5us]    1127523 |@                                               |
(65.5us, 131us]       77367 |                                                |
(131us, 262us]        61477 |                                                |
(262us, 524us]        70728 |                                                |
(524us, 1.05ms]      120540 |                                                |
(1.05ms, 2.1ms]       95772 |                                                |
(2.1ms, 4.19ms]       93887 |                                                |
(4.19ms, 8.39ms]      33511 |                                                |
(8.39ms, 16.8ms]       3352 |                                                |
(16.8ms, 33.6ms]         47 |                                                |
count: 76375205, average: 12.98us

You can get something like this:

$ sudo netstacklat --histgram-exp-base 1.09050773267 --histogram-bins 256
tcp_data_queue:
(331ns, 362ns]            6 |                                                |
(362ns, 394ns]         1777 |                                                |
(394ns, 430ns]         8627 |                                                |
(430ns, 469ns]         9955 |                                                |
(469ns, 512ns]       172417 |@                                               |
(512ns, 558ns]      1126030 |@@@@                                            |
(558ns, 608ns]      4774267 |@@@@@@@@@@@@@@@@@                               |
(608ns, 663ns]      8391747 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@                  |
(663ns, 724ns]     10553781 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@           |
(724ns, 789ns]     13523505 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
(789ns, 861ns]     10200965 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@            |
(861ns, 939ns]      6171181 |@@@@@@@@@@@@@@@@@@@@@@                          |
(939ns, 1.02us]     3884627 |@@@@@@@@@@@@@@                                  |
(1.02us, 1.12us]    2440684 |@@@@@@@@@                                       |
(1.12us, 1.22us]    1475033 |@@@@@                                           |
(1.22us, 1.33us]     916458 |@@@                                             |
(1.33us, 1.45us]     578992 |@@                                              |
(1.45us, 1.58us]     415995 |@                                               |
(1.58us, 1.72us]     352531 |@                                               |
(1.72us, 1.88us]     320667 |@                                               |
(1.88us, 2.05us]     298565 |@                                               |
(2.05us, 2.23us]     275586 |@                                               |
(2.23us, 2.44us]     258181 |@                                               |
(2.44us, 2.65us]     242228 |@                                               |
(2.65us, 2.9us]      239862 |@                                               |
(2.9us, 3.16us]      217436 |@                                               |
(3.16us, 3.44us]     197560 |@                                               |
(3.44us, 3.76us]     217922 |@                                               |
(3.76us, 4.1us]      257362 |@                                               |
(4.1us, 4.47us]      249781 |@                                               |
(4.47us, 4.87us]     256871 |@                                               |
(4.87us, 5.31us]     287401 |@                                               |
(5.31us, 5.79us]     307688 |@                                               |
(5.79us, 6.32us]     333881 |@                                               |
(6.32us, 6.89us]     375341 |@                                               |
(6.89us, 7.51us]     446672 |@@                                              |
(7.51us, 8.19us]     489102 |@@                                              |
(8.19us, 8.93us]     414785 |@                                               |
(8.93us, 9.74us]     327430 |@                                               |
(9.74us, 10.6us]     286621 |@                                               |
(10.6us, 11.6us]     272253 |@                                               |
(11.6us, 12.6us]     277478 |@                                               |
(12.6us, 13.8us]     284322 |@                                               |
(13.8us, 15us]       278177 |@                                               |
(15us, 16.4us]       264872 |@                                               |
(16.4us, 17.9us]     261153 |@                                               |
(17.9us, 19.5us]     277002 |@                                               |
(19.5us, 21.2us]     267395 |@                                               |
(21.2us, 23.2us]     262806 |@                                               |
(23.2us, 25.3us]     276017 |@                                               |
(25.3us, 27.6us]     291072 |@                                               |
(27.6us, 30us]       315936 |@                                               |
(30us, 32.8us]       339252 |@                                               |
(32.8us, 35.7us]     329475 |@                                               |
(35.7us, 39us]       276416 |@                                               |
(39us, 42.5us]       201996 |@                                               |
(42.5us, 46.3us]     133870 |                                                |
(46.3us, 50.5us]      86340 |                                                |
(50.5us, 55.1us]      58442 |                                                |
(55.1us, 60.1us]      40683 |                                                |
(60.1us, 65.5us]      26194 |                                                |
(65.5us, 71.5us]      16038 |                                                |
(71.5us, 77.9us]      11234 |                                                |
(77.9us, 85us]         9562 |                                                |
(85us, 92.7us]         8762 |                                                |
(92.7us, 101us]        8900 |                                                |
(101us, 110us]         8927 |                                                |
(110us, 120us]         8781 |                                                |
(120us, 131us]         9119 |                                                |
(131us, 143us]         9436 |                                                |
(143us, 156us]         9522 |                                                |
(156us, 170us]         9200 |                                                |
(170us, 185us]         8495 |                                                |
(185us, 202us]         7655 |                                                |
(202us, 220us]         6739 |                                                |
(220us, 240us]         6612 |                                                |
(240us, 262us]         6640 |                                                |
(262us, 286us]         7080 |                                                |
(286us, 312us]         7571 |                                                |
(312us, 340us]         7907 |                                                |
(340us, 371us]         8525 |                                                |
(371us, 404us]         9278 |                                                |
(404us, 441us]        10301 |                                                |
(441us, 481us]        10933 |                                                |
(481us, 524us]        11935 |                                                |
(524us, 572us]        12821 |                                                |
(572us, 623us]        14058 |                                                |
(623us, 680us]        14771 |                                                |
(680us, 741us]        15148 |                                                |
(741us, 809us]        16151 |                                                |
(809us, 882us]        16425 |                                                |
(882us, 962us]        16207 |                                                |
(962us, 1.05ms]       15798 |                                                |
(1.05ms, 1.14ms]      15109 |                                                |
(1.14ms, 1.25ms]      14756 |                                                |
(1.25ms, 1.36ms]      13631 |                                                |
(1.36ms, 1.48ms]      13644 |                                                |
(1.48ms, 1.62ms]      12638 |                                                |
(1.62ms, 1.76ms]      10497 |                                                |
(1.76ms, 1.92ms]      10261 |                                                |
(1.92ms, 2.1ms]       11232 |                                                |
(2.1ms, 2.29ms]       12180 |                                                |
(2.29ms, 2.49ms]      12824 |                                                |
(2.49ms, 2.72ms]      13276 |                                                |
(2.72ms, 2.97ms]      11857 |                                                |
(2.97ms, 3.23ms]      10195 |                                                |
(3.23ms, 3.53ms]      11539 |                                                |
(3.53ms, 3.85ms]      13821 |                                                |
(3.85ms, 4.19ms]      13919 |                                                |
(4.19ms, 4.57ms]       6205 |                                                |
(4.57ms, 4.99ms]       6264 |                                                |
(4.99ms, 5.44ms]       5116 |                                                |
(5.44ms, 5.93ms]       3677 |                                                |
(5.93ms, 6.47ms]       3522 |                                                |
(6.47ms, 7.05ms]       4314 |                                                |
(7.05ms, 7.69ms]       3087 |                                                |
(7.69ms, 8.39ms]       2182 |                                                |
(8.39ms, 9.15ms]       1058 |                                                |
(9.15ms, 9.98ms]        680 |                                                |
(9.98ms, 10.9ms]        353 |                                                |
(10.9ms, 11.9ms]        231 |                                                |
(11.9ms, 12.9ms]        132 |                                                |
(12.9ms, 14.1ms]         64 |                                                |
(14.1ms, 15.4ms]         22 |                                                |
(15.4ms, 16.8ms]         59 |                                                |
(16.8ms, 18.3ms]          3 |                                                |
(18.3ms, 20ms]           15 |                                                |
(20ms, 21.8ms]          100 |                                                |
(21.8ms, 23.7ms]         93 |                                                |
(23.7ms, 25.9ms]        132 |                                                |
(25.9ms, 28.2ms]         19 |                                                |
(28.2ms, 30.8ms]          3 |                                                |
count: 76699912, average: 13.41us

It does add some overhead, although it's not catastrophically high (performance comparison can be found in the commit message).

However, it also makes it a bit of a pain to use together with ebpf-exporter, as it's the userspace program that calculates and provides the bin-limits that the eBPF program uses, so without it you have to manually configure all the bin limits.

Probably not the way to go, especially as we realistically can't use that many buckets with ebpf-exporter anyways. But still thought it was interesting to explore, and if nothing else I learned that the verifier is actually pretty good at figuring out if loops will terminate (have a while loop in there that's not completely obvious that it will terminate).

@simosund
Copy link
Contributor Author

simosund commented Feb 20, 2025

In the last force push (2025-02-20T09:30 CET, commit starting at 1118029) I essentially squashed in a small change to the structure of the eBPF code that makes it easier to reuse the core functionality (recording latency until now) for other types of hooks (see commit c649f20 on the netstacklat-more-probes branch for more details).

As this is not merged yet, I figured it would be better to have the code be "correct" right away instead of having a later commit refactoring it.

@simosund
Copy link
Contributor Author

Converted to draft while I prepare an initital up-to-date version of the tool that can be merged. Will reopen once that is ready for review.

Add the tool netstacklat, which measures latency up to different parts
in the Linux ingress network stack. Base the initial implementation on
a bpftrace script from Jesper Dangaard Brouer, which requiers the
kernel to timestamp ingress packets (i.e. set the tstamp member of the
skb). Hence, the latency recorded by the tool is the difference
between the kernel timestamping point and various later points in the
network stack.

Track the latency at the following points:
- ip-start: Packet has reached the IP-stack (past tc-layer).
  Hooks fentry:ip_rcv_core, ip6_rcv_core
- tcp-start: Packet has reached the local TCP stack (done with all
  IP-processing)
  Hooks: tcp_v4_rcv, tcp_v6_rcv
- udp-start: Packet has reached the local UDP stack (done with all
  IP-processing)
  Hooks: udp_rcv, udpv6_rcv
- tcp-socket-enqueued: Packet has been enqueued to a TCP socket
  Hooks fexit:tcp_data_queue
- udp-socket-enqueued: Packed has been enqueued to a UDP socket
  Hooks fexit: udp_queue_one_skb, udpv6_queue_rcv_one_skb
- tcp-socket-read: Application has read packet data from a TCP socket
  Hooks fentry:tcp_recv_timestamp
- udp-socket-read: Application has read packet data from a UDP socket
  Hooks fentry:skb_consume_udp

Only include a minimal user space loader, which loads and attaches the
eBPF programs, in this initial commit. Later commits will extend this
loader to also report the recorded latencies stored in the BPF maps
and allow it to modify behavior of the eBPF programs.

Make the eBPF programs compatible with Cloudflare's ebpf-exporter [1],
so that statistics can easily be exported as Prometheus metrics. As
part of this, name the BPF maps used to store the histograms according
to Prometheus naming conventions [2]

However, as the macros [3] for storing values in histograms from
ebpf-exporter relies on expanding multiple levels of macros,
reimplement the needed functionality from these macros (specifically
increment_exp2_histogram_nosync) as our own functions to reduce the
risk of running into subtle and hard to debug issues.

However, keep the signature of the increment_exp2_histogram_nosync()
function identical to the corresponding ebpf-exporter macro, making it
possible to replace the function with the ebpf-exporter macros without
changing the call sites. This makes it easy to switch to using the
ebpf-exporter macros instead if desired (e.g. because the macros have
been updated to include some debugging). Notably, this requires
passing the key struct by value, which is otherwise somewhat
unnecessary for the trivial key used by netstacklat.

Note that the increment_exp2_histogram_nosync() function here only
implements the needed functionality by netstacklat. This means that it
only works for ARRAY maps (it assumes all keys already exist, so does
not zero-initialize them as needed for HASH maps), and only works for
a single key type (struct hist_key).

Instead of placing the functions for storing values in histograms in
our own maps.bpf.h header, place them directly in the
netstacklat.bpf.c source. This ensures that if moved to the
ebpf-exporter repository the program will still use these functions
rather than the ebpf-exporter macros, preventing any unforeseen issues
due to macro expansion.

Open code the histogram maps for different hook points as entirely
separate maps, instead of encoding the different hook points in the
key of a single map as ebpf-exporter often does. This allows using
fast array map lookups instead of costly hash map lookup.

[1] https://github.com/cloudflare/ebpf_exporter
[2] https://prometheus.io/docs/practices/naming
[3] https://github.com/cloudflare/ebpf_exporter/blob/master/examples/maps.bpf.h

Signed-off-by: Simon Sundberg <[email protected]>
Add functionality to the user space component to periodically fetch
the BPF maps netstacklat records the values in and print them out.

Base the core program loop on the same setup as pping, where a single
epoll instance is used to support multiple different types of
events. So far it only deals with signal handling (for clean shutdown)
plus a timer (for periodical reporting), but the setup can easily be
extended if the program grows more complex in the future.

Use the (somewhat complicated) bpf_map_lookup_batch to fetch the
entire histogram maps in a single system call (instead of performing a
lookup for each bin index individually).

Signed-off-by: Simon Sundberg <[email protected]>
Add support for parsing arguments from the command line. Add the
--help/-h and --report-interval/-r <n-seconds> options. As part of
this, add some minor documentation about the program and the available
options that are displayed by the --help option.

Base the print_usage() function on the usage() function from the
traffic-pacing-edt example, but change its output format to be more
consistent with most GNU utilities, listing the short option (if any)
before the long option. Furthermore, also show if an argument is
expected based on the has_arg value.

Automatically generate the optstring that getopt_long() expects from
the options array. While needlessly complicated for the two options
currently supported, it simplifies adding additional options in the
future.

To allow for mapping the opt value returned by getopt_long() back the
corresponding option struct (for example to access the full name of
the option), add the optval_to_longopt() helper that linearly searches
through the options array. This is more reliable than using the
longindex pointer from getopt_long(), as that pointer is only set when
a long option is passed, and not when the corresponding short option
is passed.

Signed-off-by: Simon Sundberg <[email protected]>
The netstacklat tool depends on the tstamp member in the skbs to be
set by the kernel. So make the user space loader enable software RX
timestamping on all packets when the tool is started.

Note that, as the documentation at
https://docs.kernel.org/networking/timestamping.html points out,
enabling software receive timestamping on a socket enables it for all
packets, as the destination socket is not known early in the network
stack. It is therefore sufficient to enable it on a socket created by
the program itself, even if no packets are received on that socket.

Signed-off-by: Simon Sundberg <[email protected]>
Make the user space loader calculate the TAI-offset at startup and set
it as a constant for the eBPF programs. Split the open and loading
stages of the eBPF programs apart to enable setting constants in the
eBPF programs.

Note that on some systems (e.g. most debian systems by default), the
TAI offset may (incorrectly) be 0, so that CLOCK_TAI becomes identical
to CLOCK_REALTIME. While this is principly incorrect, it does not pose
an issue for netstacklat, as it only needs the TAI offset to translate
CLOCK_TAI to CLOCK_REALTIME (which skb->tstamp is assumed to use as
clock basis). Therefore, netstacklat will (from this commit) work
correctly even if the TAI offset is not correctly set on the system.

A noteable limitation with this approach is that the TAI offset is
only set once the program is first loaded. It is not dynamically
updated in case the TAI offset changes, so if the program is running
while a leap second occurs, the recorded latencies may be off with one
second. Furthermore, as the TAI offset is set from user space, it will
not work when just using the eBPF programs together with
ebpf-exporter (will simply have to rely on the hardcoded value in
netstacklat.bpf.c then).

Signed-off-by: Simon Sundberg <[email protected]>
Add a README briefly explaining what this example does.

Signed-off-by: Simon Sundberg <[email protected]>
Add a YAML configuration that ebpf-exporter can use to convert the BPF
maps to Prometheus metrics. Convert the histogram map for each hook
into the corresponding Prometheus histogram.

Also update the README to explain how to use netstacklat together with
ebpf-exporter.

Signed-off-by: Simon Sundberg <[email protected]>
Add the -l/--list-probes option, listing the names of all possible
hooks together with a short explanation of the hook.

Signed-off-by: Simon Sundberg <[email protected]>
Add the -e/--enable-probes and -d/--disable-probes options, letting
the user control which hooks should be enabled.

Make the --enable-probes option enable the specified hooks and disable
all others. Similarly, make the --disable-probes option disable the
specified hooks and enable all others. Make these two options mutually
exclusive.

Signed-off-by: Simon Sundberg <[email protected]>
Since kernel v5.18 the skb->tstamp value for ingress packets may
contain a preserved TX timestamp to later be used on the egress path,
e.g. for EDT, see [1]. Using these TX timestamps to calculate latency
on the ingress path results in erroneous values being reported, and
due to these TX timestamps often using CLOCK_MONOTONIC instead of
CLOCK_REALTIME the error is often years of latency.

To prevent these erroneous values, check if the skb->tstamp is a TX or
RX timetamp, and ignore skbs with TX timestamps.

How to check if skb->tstamp is a TX timestamp depends on kernel
version. There are 3 cases to consider:
- Kernel < v5.18: The skb->tstamp is always reset on ingress, will
  never be a TX timestamp.
- v5.18 <= kernel < v6.11: If the skb->mono_delivery_time field is
  set, the skb->tstamp is a preserved TX timestamp.
- Kernel >= v6.11: The skb->mono_deliver_field has been renamed to
  skb->tstamp_type [2] to support additional clock_ids. If set to
  a non-zero value the skb->tstamp is a TX timestamp.

As we rely on the bpf_ktime_get_tai_ns() helper, which was introduced
in v6.1, we can ignore the first case as kernel < v5.18 will be
unsupported anyways. To handle the renaming of the field in v6.11
utilize CO-RE functionality to determine if the skb->tstamp_type field
exists. Use the bpf_core_field_exists() together with "overloading"
the sk_buff definition using the triple-dash ignored suffix rule as
explained in [3].

[1] https://lore.kernel.org/all/[email protected]/
[2] https://lore.kernel.org/all/[email protected]/
[3] https://nakryiko.com/posts/bpf-core-reference-guide/#handling-incompatible-field-and-type-changes

Signed-off-by: Simon Sundberg <[email protected]>
Add the -p/--pid-filter option, which allows the user to provide a
comma-separated list of PIDs for applications that the output should
be filtered for.

Only filter hooks running in the process context (currently only the
socket-read hooks), as the receiving application is generally not
known in the earlier parts of the stack. To also filter the earlier
hooks would require keeping track of all connections, so that the
skbs (based on the 5-tuple) can be matched to the receiving
application, and both maintaining this state and looking it up would
add considerable overhead.

To minimize overhead, implement the pid matching as a simple array map
lookup. As the maximum possible PID value is 4M, a full array map
containing every possible PID value takes up around 32MB.

Note that currently no more than 4096 PIDs may be filtered for. This
is only a limit of how many PIDs the user space agent can store when
parsing the PID list, and can easily be extended, as the BPF side can
handle up to 4M PIDs.

Signed-off-by: Simon Sundberg <[email protected]>
@simosund simosund marked this pull request as ready for review April 24, 2025 18:16
@simosund
Copy link
Contributor Author

Hi @tohojo, @netoptimizer. I've now updated this PR contain an up-to-date version of netstacklat with all features that I consider ready to be merged. Please review when you have time.

This excludes the tracking of standing queues (as that is still rather experimental and based on my own testinging doesn't work super well), and the rework to multiplex all histograms in a single hashmap instead of separate array maps (still needs some more refinement, and more extensive evaluation of overhead cost). These can be found in my branches netstacklat-queue-detection and netstacklat-multiplex-histograms (have not yet rebased them to build on top of this updated branch).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants